This is a placeholder. Final title will be filled later

نویسندگان

  • Stéphane Dupont
  • istophe Ris
چکیده

This paper intends to summarize some of the robust feature extraction and acoustic modeling technologies used at Multitel, together with their assessment on some of the ETSI Aurora reference tasks. Ongoing work and directions for further research are also presented. For feature extraction (FE), we are using PLP coefficients. Additive and convolutional noise are addressed using a cascade of spectral subtraction and temporal trajectory filtering. For acoustic modeling (AM), artificial neural networks (ANNs) are used for estimating the HMM state probabilities. At the junction of FE and AM, the multi-band structure provides a way to address the needs of robustness by targeting both processing levels. Robust features within sub-bands can be extracted using a form of discriminant analysis. In this work, this is obtained using sub-band ANN acoustic models. The robust sub-band features are then used for the estimation of state probabilities. These systems are evaluated on the Aurora tasks in comparison to the existing ETSI features. Our baseline system has similar performance than the ETSI advanced features coupled with the HTK back-end. On the Aurora 3 tasks, the multi-band system outperforms the best ETSI results with an average reduction of the word error rate of about 62% with respect to the baseline ETSI system and of about 18% with respect to the advanced ETSI system. This confirm previous positive experience with the multi-band architecture on other databases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

This is a placeholder. Final title will be filled later

This paper presents a model to predict the phrase commands of the Fujisaki Model for F0 contour for the Portuguese Language. Phrase commands location in text is governed by a set of weighted rules. The amplitude (Ap) and timing (T0) of the phrase commands are predicted in separate neural networks. The features for both neural networks are discussed. Finally a comparison between target and predi...

متن کامل

This is a placeholder. Final title will be filled later

Solutions proposed in bibliography for multiple user allocation on the sub-bands of an OFDM system, adopting multiple antennas, require highly computational effort and consider delay insensitive applications. Our approach tends to overcome all these limitations relaxing some hypothesis in order to give a feasible solution. The proposed algorithm can be applied to a real multiple antenna OFDM sy...

متن کامل

TODO: This is a placeholder. Final title will be filled later

Classification performance for emotional user states found in the few realistic, spontaneous databases available is as yet not very high. We present a database with emotional children’s speech in a human-robot scenario. Baseline classification performance for seven classes is 44.5%, for four classes 59.2%. We discuss possible strategies for tuning, e.g., using only prototypes (based on annotati...

متن کامل

TODO: This is a placeholder. Final title will be filled later

We report work on mapping the acoustic speech signal, parametrized using Mel Frequency Cepstral Analysis, onto electromagnetic articulography trajectories from the MOCHA database. We employ the machine learning technique of Support Vector Regression, contrasting previous works that applied Neural Networks to the same task. Our results are comparable to those older attempts, even though, due to ...

متن کامل

This is a placeholder. Final title will be filled later

Recent auditory physiological evidence points to a modulation frequency dimension in the auditory cortex. This dimension exists jointly with the tonotopic acoustic frequency dimension. Thus, audition can be considered as a relatively slowly-varying two-dimensional representation, the “modulation spectrum,” where the first dimension is the well-known acoustic frequency and the second dimension i...

متن کامل

This is a placeholder. Final title will be filled later

Sine-wave speech (SWS) is a three-tone replica of speech, conventionally created by matching each constituent sinusoid in amplitude and frequency with the corresponding vocal tract resonance (formant). We propose an alternative technique where we take a high-quality multicomponent sinusoidal representation and decimate this model so that there are only three components per frame. In contrast to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003